DOCUMENT RESUME 



ED 220 492 



TM 820 505 



AUTHOR 
TITLE 
PUB DATE 
NOTE 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Bunch, Michael B, 

Using Non-Normed Tests in Title I Evaluation. 
Mar 82 

23p.; Paper presented at the Annual Meeting o£ the 
National Council on Wasurement in Education (New 
York, NY> March 20-22, 1982), 

MFOl/PCOl Plus Postage. \^ 
^Compensatory Education; Correlation; Criterion 
Referenced Tests; Elementary Secondary Education; 
^Evaluation Methods; Models; ^Program Evaluation; 
Scores; Test Format; ^Testing Problems; ^Test Norms; 
Test Use 

Elementary Secondary Education Act Title I; *Non 
Normed tests; *RMC Models 



ABSTRACT 

Research evidence relating to the utility of the RMC 
evaluation models of compensatory education employing non-normed 
tests is examined. The history and evolution of five early models 
into the current norm-referenced model utilizing a non-normed test 
(Model A2), non-normed versions of a comparison model (Model B2), and 
the regression model (Model C2) are described. Technical 
considerations of minimum correlation, score overlap and score 
distribution for Model A2 and the estimation of population standard 
deviations on the non-normed test for Models B2 and C2 are discussed. 
Practical considerations include group size, grade level, type of 
score used and instrument sensitivity. Problems and proposed 
solutions are discussed. The preference for non-normed tests is shown 
to stem from a perceived discrepancy in the sensitivity of tests to 
instructional objectives, so the task of comparing alternative 
implementation. strategies for various models is considered. The 
utility cohstraints of the Education Consolidation and Improvement 
Act of 1981 are discussed. In the light of practical alternatives 
presented and seemingly insurmountable problems in Models A2, B2 and 
C2, it is recommended they be abandoned. (Author/CM) 



********************** ******* 

* Reproductions supple !Kd by EDRS are the best that can be made 

* from the original document. 

********************************************************************* 



^ UM, OErARTMENT OF EOUCATIOM 

NATIONAL INSTITUTE OF EDUCATION 
EDUCATIONAL RESOURCES INFORMATION ' 

CENTER <EaiC) 
OC'This documtnt h«s b««n rtprcxJucad it 

reciived from the person or OfQanizatlon 

originating it. 
□ Minor changes hav« been mada to improve 

reproduction quality. 

• Points of view or opinions statad In this docu- 
ment do not nacessarily raprtsant official NIE 

position or policy. 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC)." 



USING NON-jiORMED TESTS IN TITLE I EVALUATION 



Michael B. Bunch 
Measurement Incorporated 



Paper presented at the annual meeting of the National Council on Measurement 
in Education, New York City, March, 1982. 



USING NON-NOSMED TESTS IN TITLE I EVALUATION 

S . '•J . 

tChe U.S. Office of Education (now Che U.S • Department of Education) , 
pursuant to provisions of the Education Amendments of 1974 (Public Law * 
9 3-380), developed three evaluation models for education programs funded by 
Title I of the Elementary and the Secondary Education Act of 1965 ..(Public La^ 
89-10). Each of these models has two versions. In version one of each model, 
only normed referenced tests are used. In version two, both normed and 
non^normed tests are used. These models (there are six in all) as well, as the 
supporting system (the Title I Evaluation and Reportinj^ System or TIERS) are 
described in detailed in a user's guide (Tallmadge and Wood,ol976). Explicit 
reference to the evaluation models was made in the 19 78 Education Amendment^ 
(Public Law 95-561), and subsequent regulations required the use of these 
models. Now, however, with passage of the Education Consolidation and 
Improvement Act of 1981 (Public Law 9 7-35; Title V, Subtitle D) the federal 
role in evaluation has become one of providing nonbinding guidelines* 

In light of the changing role of the federal government in the evaluation 
of compensatory education programs, it seems appropriate to examine these 
models. The specific purpose of this paper is to highlight research evidence 
relating to the utility of the versions of the models employj^ng ndn-normed 
tests. ""A related objective is to provide some guidance to those individuals 
at the national level who will establish the nonbinding guidelines and to the 
individuals at th^e state and local levels who. will have to choose among 
several evaluation strategies. - 

This paper traces some of the historical developments of the various 
evaluation models currently used in Title I evaluation. In addition, some of 

e. 

the problems encountered in the application of these models are discussed 



along with some of the solutions that have been proposed. Finally, 
recommendations relating to the relative appropriateness °o£ the various models 
are given. 



\ 



■ \ 

Historical Perspective 



In the early stages of development of the current evaluation models, there 

were actually five models (U.S. Department of Health, Education and Welfare, 

undated). These models are briefly described as follows: 

o Model 1 * Posttest Comparison with Matched Groups. "This model 

requires that children be paired in terms of pretest measures and that 
one member of each pair be randomly assigned to the treatment group 
and the other to the comparison group." (Ibid. p. 49). 

. \ \. ^ ' ' 

o Model 2 ~ Analysis of Covariance. Analysis of covariance provides an 
appropriate statistical adjustilient to compensate for pretest score 
differences bejtween groups if these differences were due, to such 
chance factors] as random sampling fluctuations. (Ibid. p. 54) 

o Model 3 * Spedial Regression Models. This was actually two regression 
models,, one based on the regression projection model (Tallmadge and 
Horst, 1974).] The other based on the regression discontinuity model 
(Campbell and! Stanley , 1963) . 

o Model 4 - General Regression Mokel. This model is actually more 
similar to the analysis of covariance model. Essentially, several 
independent variables may be used to develop a regression equation to 
predict posttest scores. 

o Model 5 Normed Referenced Model. "Project children are compared t0 
a norm group usually comprised of a nationally representative sample 
of children at the same grade level. The no-treatment expectation is 
that th^ project Vpupils will maintain, at posttesting, the same 
achievement status with respect to the norm group as they had at 
pretesting." (U.S. Dept. of HEW,- p. 72). 

In the description of these models, there is no reference to the use of 

ncn-*normed tests. It is apparent that the intent of the model developers was 

to use standardized tests only. However^ upon completion of extensive field 

testing of the five evaluation models an^ discussions with state and local 

education offi'cials in every state in the country, developeris reduced the 

number of models to^thteet a^i^d added a non-*nomed version to each model 



CCamel, Tallmadge, Wood, and Binkley , 1975) • 

Thus, Che Title I Evaluacion and Reporting System that exists today is 
based primarily on a compromise between technical excellence and political 
reality. The posttest comparison With tiiatched groups (Model 1) and analysis 
of covariance (Model 2) have been combined to fori^ the comparison group model 
(Model B). The Campbell and Stanley version bf/the regression model (i.e. the 
regression discontinuity model) has been dropped in favor of the Tallmadge and 
Wood regression projection model and is currently known as the special 
regression model (Model C) • The generalized regression model (Model 4) was 
dropped entirely, and the normed referenced model (Model 5) has survived as 
the norm referenced model (Model A). Furthermore,, each model now allows for 
the use of noa-*normed tests. This allowance is clearly the result of input 
from state and local education officials. 

Within each of the six evaluation strategies, all program effects are 
described in terns of Normal Curve Equivalents (NCEs). This metric consists 
of a standard score scale with a mean of 50 and a standard deviation of 
21.06. The scale was constructed so that the NCE value and percentile value 
would be the same at 1, 5(1, and 99. Other aspects of NCEs are described by 
Tallmadge and Wood (1976). 

In the non-normed versions of each model, score gains are estimated by 
linking a normed test to a non*-normed test. In the version of tue norm 
referenced model which utilizes a non-normed test (Model A2), the two tests 
are linked through an equipercentile equating procedure, and gains on the 
non-normed test are translated into estimated normed test gains. In the 
non-normed version of the comparison model (Model B2) ^nd the regression model 
(Model C2), gains are expressed in terms of the hypothetical distribution of 



• « * 0 

' ■■ • ^ ■ " . ' ■ ' : 1 

scores on Che non-moxmed cescs. Estimation of population parameters, 

o 

specifically standard deviations^ becomes the key technical issue in the use 
of Models B2 and C2. While applications of Model A2 had been fairly common in 
many states, the use of Models B2 and C2 is extremely rare. 

Problems Encountered 

Problems encountered in the use of Models A2, B2, and C2 divide fairly 
nejatly into two categories: practical and technical. Many of the practical 
problems described below are experienced in all three of the models. Some of 
the practical problems described below (for example, testing at or near the 
empirical norming date) are also observed with the normed versions of each of 
the models. The technical problems encountered, however, clearly divide the 
models into two groups: Model A2 and Models B2 and C2. Since the technical 
problems for Models B2 and C2 are so similar, these two models are grouped 
together throughout the remainder of this paper. Any discussion of problems 
or proposed solutions for either of these two models should be considered 
appropriate for the other. Discussions of proposed solutions for Model A2 
should not be generalized to Models B2 and C2 unless specifically noted 
Otherwise. 

Before describing the problems encountered with Model A2, perhaps it would 
be helpful to look at the way in which Model A2 developers intended for it to 
be implemented. In Model A2, the following situation is typically found. An 
evaluator tests pre and post with a. non-formed test (either a locally 
developed or a commercially available criterion referenced test) and 
administers a normed test either as a pretest or as a piDsttest. The exact 
steps to be carried out (assuming a normed pretest) azje as follows: 

' . . . 

ERIC *^ , 

c 



Administer at pretest time non-mormed and nationally normed tests, 
according to normative data points for the normed test. 

Obtain the correlations between the normed and the non-no riaed test £or 
the population. If the correlation is less than .60, use Model Al. 

Determine median pretest raw score for the normed test. I 

Determine national percentile from pretest norms table which 
corresponds to median raw score, representing the expected no 
treatment effect. . ' 

Convert the no treatment percentile to an NCE, representing the 
expected no treatment effect. 

- ' i 

Administer the identical non-normed test at posttest time, according 
to normative data points for the normed test (that is, at or nea^i the 
empirical norming date of normed test). 

Determine the median post test raw score for the non^ormed tests. 

Convert the median post test raw score to a pretest percentile (i.e. 
determine how many students scored below that point at pretest time). 

From pretest norms, find the normed test raw score co^rrepsonding to 
the percentile o6tained in step eight. 

10. From posttest norms, find the normed test percentile corresponding to 
the raw score obtained in step nine. 

11. Convert this percentile to an NCE. 

12. Subtract the results of step five from the results of step 11. This 
is the observed Title I effect. 

This process is referred to as equipercentile equating at the median i 

only. The same process may be applied, with some modifications, if the normed 

test is administered as a posttjest. The technical problems associated with 

the implementation of Model A2 have to do with the correlation between the 

norm referenced test and non-no rmed test, the overlap of scores from pretest 

to posttest, and what are commonly referred to as floor and ceiling effects in 

either the non-normed test or the normed test. Other, practical problems may 

^ ■ ■ • I 

also arise. These practical problems involve group size, grade level, 



1. 

2. 

3. 
4. 

5. 
6. 

7. 
8. 

9- 



-5- 



ERJC 



iascrumeat: sensitivity, and type of score used (either raw score or mastery 

s^ore oa the aoQ^-ncirmed test). The research relevant to each of these issues 

I 

is presented below along with some of the proposed solutions. 

Although the procedures described abov^ for the implementation of Model A2 
do not use the coefficient of correlation between the normed and non-normed 
tests^at any point, a low correlation casts extreme doubt on the usefulness of 
the evaluation results. As noted previously, the models developers recommend 
a minimum correlation of .60 (Tallmadge and Wood, 1976). Even when the 
minimum correlation of .60 is obtained, the two tests share only 36% common 
observed variance. Several investigators have shown that even this minimum 
correlation of .60 may be very difficult to obtain under normal circumstances 
(cf., Storely, Rice, Harvey, and Crane, 19 79; Bunch and Dixon, 1980; Kahn and 
Overton, 1980). 

A somewhat more subtle technical problem has to do with the overlap of 
scores from pretest to posttest of the non-hormed tests. Specifically, step 
eight of the implementation procedures requires that the median posttest raw 
score be converted to a percentile on the pretest score distribution. If the 
average student obtains a posttest raw score higher than that of the highest 
scoring student on the pretest, then Model A2 cannot be implemented. This 

■ 0 

situation is illustrated below in Figure 1. 

? . ^ 



ERIC 



-6- 



8 



Frequency 




Figure I* Hypothetical distribution o£ prete^ and posttest scores on a ^ 
non-normed test* 

The problem of score overilap has been discussed in a monograph produced 
for the U.S. Department of Education by RMC Researclfi Corporation (U.S 

J Department of Education/RMC Research Corporation^ I98I) as well as by Gamel 

I ■ ' 

(Note 1) and Bunch and Dixon (1980). This problem is closely related to the 
problem of choice of score (i.e., total raw score vs. number of objectives 
mastered). It can be shown,' for example, that in choosing the objectives 
mastered indicator as the pretest and posttest score, the raiige at both 
J pretest and posttest times will be extremely restricted. If, on the other 

hand, raw scores are usedj there is more likely to be a spread of scores at 
! both the pretest and at the posttest. However, even when raw scores are used, 

the overlap between protest scores and posttest scores is likely to be 

t 

minimal, if instruction was effective (cf., Popham, 1978). 

One problem frequently found in all types of program evaluation is the 
problem of test floor or ceiling effects. Floor effects are those effects 
observed when the test administered to students is too difficult. 
Consequently, most students receive very low scores. Additionally when the 
test is a multiple choice test (as are nearly all normed tests), the average 
score may approach the score that might be obtained by chance guessing. The 

. ERIC 3 / 



mean score is thu8| Coo high (not too low as some might contend)* . Ceiling 



effects on the other hand, occur when a test is given that is too easy for the 
achievement or ability level of the student population to which it is 
administered. Since the upper limit of achievement is not tapped, mean scores 
are too low. 

In the case of Model A2, floor and ceiling effects present a double 
problem. That is, not only does one have to worry about floor and ceiling 
effects on i^he ^norm referenced test but one must deal with these effects in 
the non^normed test as well. Specifically, when one uses only two tests as in 
Model Al (a nortd referenced test administered at pretest and at posttest 
time), there are nine possible combinations of floor, and ceiling effects. Of 
these nine combinations, only one will yield an appropriate estimate of the 
gain score. When three tests are used, the number of possible combinations of 
floor and ceiling effects is 27. Of these 27 possibilities, only one will 
yield an appropriate estimate of the gain. 

Brummet and Masters (1980) presented data illustrating the problem that 
occurs when a ceiling effect is observed on the non-mormed test at posttesr 
time. ^Their data showed that under thes^ conditions. Model A2 systematically 
underestimated the size of gain for the Title I program. Their sample focused 
on a single project in a school district where it was later observed that 
evaluations for other projects were seriously flawed with both floor and 
ceiling effects. 

Crane, Prapuolenis, Rice, and Perlman (1981) used computer generated data 
to test the effects of various model violations on the outcomes of Model A2. 
Ona^of the conditions considered was extremely positively or negatively skewed 




Their skewed score distributions 



correspond very closely to the distributions o£ scores observed when floor or 
ceiling effects occut. A major finding of the study by Crane et a_l. (19dl) 
was "when Model A2 is applied with CRT data as negatively skewed as observed 

in Chicago Title I CRT data* all of the equating procedures examined will 

■ - i 

result in considerably biased NCE gain estimates." (p.4) 

In 1978» the U.S. Office of Education/Office of Planning, Budget, and 

Evaluation requested the formation of a national committee to examine the norm 

referenced evaluation model /(Model A). This committee contained a 

subcommittee to examine problems associa>ted with Model A2. In October of that 

year, the subcommittee on Model A2 made the following recommendations 

regarding the implementation of Model A2: 

(1) the TACs (Technical Assistance Centers) should provide for a | 

hierarchy of strategies for analyzing non-normed test data that will 
allow LEAs of varying size and. techriical sophistication to select a 
particular strategy. A tenative hierarchy was proposed by the 
subc oimnittee: 

a. Develop normalized test score distributions for the non^-normed and 
normed tests at the local level. ^ 

b. Use a curvilinear analog fovj equating test across the entire score 
range as described in Angoffi (19 71). 

c. Use a linear analog for equating tests as described iti Angoff (1971). 

d. Use the current A2^"s4^ategy, however, if this alternative is chosen, 
it is recommended thacliti^ TACs make efforts to assist LEAs in: (1) 

- the development or selectioir-of a non-Tiormed test, and (2) the 

selection of ah appropriate level^-oi a normed test in order to avoid 
the possible effects ot score range di>fe^u^ut ion (Hansen, 1978, pg. 
24) . 

The previously cited study by Crane et al. (1981) was a direct Tesponse to 
the Model A2 Subcommitt^\e Report. Crane et . investigated the effects of 
several different levels of NRT-CRT correlation, levels of sample size, and 
levels of treatment effect on the amount of bias introduced into MCE gains 
estimates obtaiiied by the four different equating procedures described above. 

■ • r ■ 

Data in the Crane et al> study were computer simulated. 



imong the findings presented by Crane et^ al^* were several analyses 
comparing the four equating methods across treatment conditions, sample size, 
and correlation level. In virtually every instance, all of the procedures 
proposed by Hattsen* s group produced smaller errors than those produced by the 
Model A2 procedures described earlier in this paper as the standard Model A2 
procedures. 

Bunch and Dixon (198Q) carried the Crane et al. analyses a bit further by 
allowing for both forward equating and backward equating in Model A2.^ 
Bunch and Dixon examined seven methods of estimating gains* These seven 
methods are presented below: 

Method 1 - Standard Model A2, predicting posttest from pretest 

r 

Method. 2 - Standard Model A2, predicting pretest from posttest 

Method 3 - Regression Method, predicting posttest from pretest 

Method 4 - Regression Method, predicting pretest from posttest 

Method 5 - Linear Equating, predicting posttest from, pretest 

Method 6 - Linear Equating, predicting pretest from posttest 

Method 7 -Standard Model Al; the standard of comparison for methods 1 
through 6. 

The equations for Methods 1 through 6 are dscribed in detai' by Bunch and 
Dixon (1980). . 

Bunch and Dixon examined two separate data bases in the comparison of 
these equating methods. The first data base came from a Virginia school 
division where TIERS Models Al and A2 had been simultaneously implemented. 

^The reader will recall that the equ4t^ing in Model A2 may b^d either at 
pretest or at posttest. When one equates at pretest, one predicts the 
posttesjt scores. When one equates at posttest, one predicts pretest scores. 



0 



This data base, contained some of. the problems discussed previously (for 
example low correlation and lack of score overlap). 

The Virginia data base consisted of pretest and ^osttest scores for 138 
students. Each student had taken one of seven l^evels of a locally developed 
criterion referenced test in the fall of 1978 and again in the spriiig of 
19 79. Additionally, an appropriate l^vel of the SRA Achievement Battery 
, (Science Research Associates » 1974), was administered at both testings. Each 
student thus had four scores; normed pretest, normed posttest, non-normed 
pretest, and hon-mormed posttest. . 1 

Because seven levels of the CRT w^re adininistered, there were fourteen 
CRT-NRT correlations (seven pre, seven post). Of these fourteen, only one 
reached the minimum acceptable level of .60. Thus, Methods 1 and 2 became 
technically infeasible, and fpr all practical purposes. Methods 3 through 5 
became practically unfeasible. However, where correlation between normed and 
non^normed tests wa^ relatively high, estimates based on- correlations (Methods 
3 and 4) were generally fairly accurate. T|ie regression methods appeared to 
work better overall than linear equating methods or the standard Model A2 when 
the results from Model Al were used as the criterioii. 

The second data base was that used by Pellegrini, Horwitz, and Long 
(19 79). In that study, two standardized tests had been given to groups of 
third graders (N » 64), fourth graders (N = 55), fifth graders (N = 58), and 
^ sixth graders (N * 81). Each student had been administered an appropriate 

level of the California Achievement Test (CAT) Form C and the Comprehensive 
^ Test of Basic Skills (CTBS) Form S on two separate occasions. Thus, each 
student in the sample had four standardized test scores. 

-11- 



ERIC 



ll3 



These data wex;e chosen because either of the two nomed tests could be 
tjreated as a non--normed test, and Methods 1 through 7 could be a ^plied 
accordingly • With the CAT«-CTBS data, there was ho clearly superior method o£ 
equating. In some instances, the forward regression method seemed to work 
best, while in other ix^stances the standard Model A2 predicting posttest from 
pretest seemed to work better* 

One finding of the second study was the discrepancies between gain 
estimates produced by the California Achievement Test and the Comprehensive 
Test of Basic Skills. Since both of these tests are norm referenced, this may 
sound like a problem of test selection for Model Al. ^owever, the pattern of 
discrepancies was similar to the pattern of discrepancies in gains when one 
compares norm referenced tests and criterion referenced tests. 

In most instances, the norm referenced test will produce a greater 
dispersion of scores than will the criterion referenced or non-normed tests. 
In this specific instance, because large raw score gains on the CAT are 
required to produce modest NCE gains, such gains translate into extremely 
large giains on the CTBS where relatively small gains are capable of producing 
NCE gains, similarly, a small score gain on a criterion referenced test may 
translate into a relatively large NCE gain on the standardized or norm 
referenced test. One is forced to wonder whether or not this particular 
feature of Model A2 has been its primary selling point over the past several 
years. 

Some of the problems previously discussed may have been exacerbated by the 
relatively small size of the sample involved. Gamel (Note 1) analyzed several- 
studies and concluded that a sample size of 300 would be adequate for 
Implementation of Model A2. Gammel went on to suggest that Model A2 is a good 
model for aggregation at the state or federal level. However, particularly at 
the lo<^al lev^l, a sample size of 300 seemriinlikely . 



with respect to grade levels Gamel (Note I) £ound that gains become more 
stable at the higher grades. It should also be pointed out, however, that- 
gains also tend to decrease at the higher grades (See Note 2) • Thus, Model A2 
^)may appear to. be a suitable model at the higher grade levels with £airly large 
sample sizes. Again, one runs into the practical problem that there are £ewer 
and fewer students involved in Title I as they progress through elementary 
school and into junior high school and high school (See Note 2). In other 
words, where Model A2 appears to work best, there appear to be the fewest 
number S' of students on whom it may be used. 

Related to the technical problem of low NRT-CRT correlation is the 

practical problem of differential instrument sensitivity. This problem was 

t 

discussed by ^ish (1979) and could easily account for some of the results 
found by Linn (1979). This practical problem seems to be at the very heart of 
the decision to use Model A2. Because Model A2^ requires the administration of 
three tests rather than two, there must be some practical advantage to using . 
it. Most users of Model A2 contend that dhe criterion referenced test is more 
sensitive to their instructional program than their standardized norm 
referenced test (c.f. ^Note 1). In these instances, it is very frequently the 
case that the standardized norm referenced test is one used by the entire 
school district for general testing purposes. This test may or may not be 
appropriate to the Title I objectives and practices. The non-normed test, on 
the other hand, is quite frequently developed specifically for the Title I ^ 
ptogram or by Title I staff in conjunction with other school staff for 
specific instructional objectives taught by Title I teachers lad regular 
classroom teachers alike. 

-13-* 

ER?C - ' ■ 15 



That tests do not all measure exactly the same thing is amply pointed out 
by Porter, Schmidt, Floden, and Freeman (1978). They show that there are 
major differences in content among commonly used standa^rdized norm referenced 
tests. How much greater then is the content difference between nationally 
produced} standard ize^d, nbrm referenced tests and locally produced, criterion 
referenced tests? / .'J 

Indeed, if the scenario just mentioned, is fairly widespread, then it would 
seem that Model A2 be %n use in a very biased sample of school districts 
across the nation. That is, where the locally used norm referenced test 
appears to Title I evaluators^ to be sensitive to the Title I and regular 
classroom objectives. Model Al will be used. Where the district-wide test 
does not appear to be content valid to the Title I evaluator, Model A2 is more 
likely to be used. 

However, Title I program effects are expressed in terms of gains on the 
norm referenced tests. Thus, gains are being explained on a fairly 
insensitive measure.. The more insensitive the measure the more likely it is 
that Model A2 will be used. Thus, the more insensitive the norm referenced 
test, the lower the corre-ta^ion between normed tests and lion normed tests is 
likely to be. Model A2 would therefore appear valid only in those cases where 
it is\^not greatly needed. 

The problems and associated research for Models B2 and C2 are less 
extensive than those of Model A2. All reports to date have focused on the 
issue of estimating population standard deviation on a non-normed test, the 
main underlying concept of th^e two models. Long, Horwitz, and DeVito (1978) 
showed that the standard procedure for estimating the population value of the 
standard deviation on the non-normed test ( a^nn ^ systematically biased. 



Specif ically> this estimate <^vas systematically high when the wrong level of 
the normed test was. used and^ systematically low when the nbn-normed test 
scorerange was restricted. Since Title I test scores are quite likely to be 
very restricted, it seems reasonable to expect that a r\r\ vrould usually be 
underestimated. 

' Sunch (1979) proposed a solution to the estimation of a^nn* '^i^ 
solution was based on previous work by Cronbach (1971) and relies on sample 
correlations and sample standard deviations. This proposed solution was 
subsequently criticized by Pellegrini, Long, and Horwitz (1979). Pellegrini 
et al. argued that the solution proposed by Bunch also systematically ^ 
underestimated a and thus systematically overestimated the Title I 
effect. Bunch (19 79) ^eveloped an alternate formula, again based on work by 
Cronbach, but this time using sample correlations between normed and 
non-normed tests corrected for attenuation. This alternative produced more 
reasonable estimates of the population of standard deviation on the non-normed 
tests and thus better estimates of the Title I effect. One serendipitous 
finding of the Bunch study was that scale scores are far superior to raw 
scores for this type estimation (Tallmadge and Wood (19 76) had recommended 
using raw scores). In every instance, whether the original formula for the 
estimation of the population standard deviation was used or either of the 
alternatives were used, the estimate was consistently more accurate when scale 
scores, rather than raw scores, were;\ used. 

There has been virtually no investigation of Models B2 and C2 since 1979* 
Furthermore, there appears^ to be have been little if any reaction to the 
proposed solutions offered at that time. However^ as noted earlier, use of. 
Models B2 and C2 is quite rare or perhaps non-existent. 



Concluaions and Recommendations 

The utility of Models A2» B2» and C2 has been, examined from Che 
perspectives of technical adequacy and practicality. The technical issues 
have focused on minimum correlation, score overlap, and score distribution for 
Model A2 and the estimation, of population standard deviations on the 
non^normed test for Models B2 and C2« The practical considerations have 
included group size » grade leyel, type of score used, and instrument 

'i 

sensitivity. It is on this last issue that the distinction between practical 
and technical concerns becomes blurred. 

In the final analysis the utility of any model is relative. Since the 
driving force behind the selection of Models A2, B2, and C2 seems to be a 
preference for non-normed tests, and since a great deal of this preference 
seems to ste^m from a perceived discrepancy in the sensitivity of tests to 
instructional objectives, one is faced with a task not simply of comparing 
models but of comparing alternate implementation strategies for various 
models. For example,-^ during the period under study (roughly 1974 to 1981) 
most major test publishers have re-normed their standardized achievement 
tests. As older versions of these tests are removed from the market place, 
local evaluators have been faced with the task of selecting different tests or 
perhaps purchasing the newer version. of the old tests. Having examined the 
instructional sensitivity of the old tests, many evaluators have lobbied 
strongly in their districts for the selection of more instructionally 
sensitive norm referenced tests* Indeed, the recommendation to do so has been 
at the heart of much of the technical assistance provided by the Title I 
evaluation Technical Assistance Centers. 

As more and more school districts adopt norm referenced tests that ^re 
highly instructionally sensitive, the motivation for using non-*normed tests in 

ERIC \ V 18 



Tide I evaluation will diminish. This phenomenon has been observed in many 
school discriccs served by the auchor throughout HEH (now ED) Region III 
(Pennsylvania, Delaware, Maryland, District of Columbia, Virginia, and West 
Virginia). Indeed, the number of applications of Model A2 in Virginia, for 
example, has diminished by approximately 75 per cent (from 22 down to 3) over 
the last three years. Most evaluators who have dropped Model A2 cited 
pr£ictical considerations. 

Perhaps the moat damaging blow to the non-normed models has been the \ 
emphasis in the 1978 amendments and the 1981 legislation on sustained 
effects* the Education Consolidation and Improvement Act of 1981 stressesf 
evaluation that cavers a period of time of at least one calendar year. This 
emphasis presents severe constraints for the non-normed models. These models 
have typically been used in fall*-spring testing programs. .A typical case 
would involve the administration of the normed and non-normed pretest in the 
fall and a non-no rmed posttest in the spring for Model A2. With the emphasis 
on twelve-month evaluations, many districts have adopted a spring-spring 
testing schedule. Given this testing schedule, it is necjessary to administer 
a norm referenced test each spring. To administer the non-normed test and 
then estimate scores on the subsequent spring normed test seems somewhat 
ludicrous when the actual scores on that test are available. Thus, even where. 
Model A2 might seem feasible, one is forced to weigh its feasibility against 
the imposition of additional testing. In this situation, once-a-year testing 
has won out more .of ten than not. 

Less testing which serves the same purpose as more testing would obviously 
seem to have greater utility. At the same time, if two tests measure 
instructional objectives equally well and one has the added advantage of 
national norms, the test with national norms would seem to have a greater 
utility. ^ 



When Models A2» B2 anid C2 were first introduced « there was little emphasis on 
once a year testing and a gr^at deal of making do with whatever norm 
referenced test happened to be available. Thus, the availability of more 
appropriate norm referenced tests and the emphasis upon full year evaluations 
have worked together to undermine the relative utility of non-normed models. 

The^ measurement of change or growth will forever be fraught with problems 
(cf.» Harris, 1963). In every instance, our best estimate of the amount of 
growth produced by a particular program or project is simply that, an 
estimate. With non-normed models, we end up with estimates of estimates. In 
light of practical alternatives to models A2, B2, and C2, and given the 
seemingly insuraountable problems inherent in these models, it seems totally 
appropriate to advocate abandoning them as we approach a new era in the 
evaluation of compensatory education programs. 



i 



^ -18- 

20 



REFERENCE NOTES 



Nona.N. Gamel * The adequacy of Model A2« Draft report prepared for 
the Department of Health, Education & Welfare, U.S. Office of 
Education/Office of Evaluation and Dissemination by RMC Research 
Corporation, Mountain View, California! April 1980« 

RMC Research Corporation * Preliminary reports to Office of Planning 
and Evaluation, U.S. Department of Education, 1^82. 




■'4 



-i> 



21 



REFERENCES 

Angoff, Wv^ ScaleSi norasi and equivaient scores. In R.L. Thorndike (Ed.), 
Educaitional measurement . WashingtoHi D.C.: American Council on 
Education, 19 71. 

Brumet, M. & Masters, B. Floor and ceiling effects with Model AI and A2. In 
Lewis, J.S., (moderator) On the uses of criterion referenced tests in 
Title I Evaluation: A timeand place for everything. Symposium presented 
at the annual meeting of the Eastern Educational Research Association, 
Norfolk, Virginia, March 1980. 

Bunch, M.B. Linking normed and non^normed tests for evaluation of Title I 
programs. Paper presented at the annual meeting of the Eastern 
Educational Research Association, Kiawah Island, South Carolina, February 
1979. 

Bunch, M.B., and Dixon, R., ^Linking Normed and Non*normed Tests in TIERS Model 
A2: Sometimes you can't get there ^from here. In Lewis, J.S. (moderator). 
On the uses of criterion referenced tests in Title I Evaluation: A time 
and place for everything. Symposium presented at the annual meeting of 
the Eastern Educational Research Association, Norfolk, Virginia, March, 
1980. 

Campbell, D.T. , & Stanley, J.C. Experimental and quasi--experimental design 
for research an teaching. In N.L. Gage (Ed.), Handbook of research on 
teaching . Chicago: Rand McNally, 19 63. ' 

Crane, L.R., Prapuolenis, P.G. , Rice, W.K., & Perlman, C. The effect of 
different equating methods pn Title I evaluation Model A2 NCE gain 
estimates: Final Report (USDE Contract #300-79-0485) . Evanston, 
Illinois; Educational Testing Service, 1981. 

Fish, O.W. An analysis of the evaluation data when eIsEA, Title I evaluation 
models Al and A2 are empirically field tested simultaneously. Paper 
presented at the Annual Meeting of the American Educational Research 
Association, San Francisco, California, April 1979. 

I 

Gamel, N. , Tallmadge., G.K. , Wood, C.T. , & Binkley, J.^L. State ESEA Title I 
Reports: Review and Analysis of past reports, and development of a model 
reporting system and format (report prepared for the U.S. Department of 
' Health, Education and Welfare). Mountain View, California; RMC Research 
Corporation, 1975. 

Hansen, J.B. Report of the Committee to Examine Issues Related to the Use of 
the Norm Referenced Model for Title I Evaluation , Northwest Regional 
Educational laboratory, October, 1978. 

Kahn, L. , & Ovetton, W. Model A2 from a practical point of view: Which way 
to go? In Lewis, J.S. (moderator). On the uses of criterion referenced 
teats in Title I evalii^ation: A time and place for everything. Symposium 
presented at the annual meeting of the Eastern Educational Research ' 

Association, Norfolk, Virginia, March 1980. 

^1 ^ ... . I 

ERJC , ^ 720- 22 




I . 

Linn» R.L. , Validity of inferences based Che proposed Title I evaluation 
models. Educational Evaluation and Policy Analysis > JL, 15-22. 

Ll>ng; J., Horwitz, S., and DeVito, P. An ^empirical inve^lstigation of the ESEA 
Tiftle I Evaluation System's proposed >^ariance estimation procedures for 
use with criterion referenced tests. Paper presented at the annual 
< . meeting o£ the American Edjix^tional Research Association, Toronto, March, 
1978. 

Pellegrini, A.D., Long, J.V., and Horwitz, S. An empirical investigation of 
the ESEA Title I Evaluation System's proposed variance estimation 
procedures and the proposed alternative for estimation of variances in 
Title I evaluation models for use with criterion referenced' tests. Paper 
present,ed at the annual meeting of the American Educational Research 
Association, San Francisco, California, April, 1979 . 

Popham, W.J. Criterion referenced measurement . Englewood Cliffs, N.J. : 
Prentice Hall, 1978. 

\^ 

Porter, A.C. , Schmidt, W.H. , Floden, R.E., & Freeman, D.J. Practical 
significance in program evaluation. American Educational Research 
Journal, 1978, 15, 529-540. 

Storlie, T.R., Rice, W., Harvey, P., & Crane, L. An empirical comparison of 
Title I NCE gains with Model Al and A2. Paper presented at the annual 
meeting of the American Educational Research Association, San Francisco, 
California, April, 1979. 

Tallmadge, G.K. , and Wood, C.T. User's Guide : ESEA Title I Evaluation and 
Reporting System . Mountain View, California: RMC Research Corporation, 

1976. y 

y 

U.S. Department of Health, Education & Welfare. A practical guide to 

measuring project impact on student achievement . Washington, D.C. ; U.S. 
Department of Health, Education & Welfare/Office of Education, undated. 

U.S. Department of Education/RMC Research Corporation. Evaluator ' s references : 
Title I Evaluation and Reporting System ^ Volume II. Washington, D^.C: 
U.S. Government Office, 19 81. 



ERIC 



-21- 



3 



